-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-48481: [Ruby] Correctly infer types for nested integer arrays #48699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
|
Benchmark code require "benchmark"
require "arrow"
SIZES = [100_000].freeze
# Pre-generate arrays so we only measure builder cost.
DATASETS = SIZES.each_with_object({}) do |size, datasets|
datasets[["int32", size]] = Array.new(size) {|i| (i % 2_000_000) - 1_000_000 }
datasets[["int64", size]] = Array.new(size) {|i| (i % 2**40) - 2**39 }
datasets[["uint32", size]] = Array.new(size) {|i| i % 4_000_000_000 }
datasets[["uint64", size]] = Array.new(size) {|i| i % 2**40 }
end.freeze
Benchmark.bm(36) do |x|
DATASETS.each do |(name, size), values|
x.report(" array_builder #{name} #{size}") do
Arrow::ArrayBuilder.build(values)
end
next unless name.start_with?("int", "uint")
builder_class = name.start_with?("uint") ? Arrow::UIntArrayBuilder : Arrow::IntArrayBuilder
x.report("int_array_builder #{name} #{size}") do
builder = builder_class.new
builder.build(values)
end
end
end |
| builder_info[:scale]) | ||
| Decimal256ArrayBuilder.new(data_type) | ||
| when :int | ||
| required_bit_length = builder_info[:bit_length] + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need +1 only for :int?
| [3, 4], | ||
| ] | ||
| array = Arrow::Array.new(values) | ||
| data_type = Arrow::ListDataType.new(Arrow::UInt8DataType.new) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simplify this:
| data_type = Arrow::ListDataType.new(Arrow::UInt8DataType.new) | |
| data_type = Arrow::ListDataType.new(:uint8) |
| data_type = Arrow::ListDataType.new(Arrow::Int8DataType.new) | ||
| assert_equal({ | ||
| data_type: data_type, | ||
| values: [ | ||
| [0, -1, 2], | ||
| [3, 4], | ||
| ], | ||
| }, | ||
| { | ||
| data_type: array.value_data_type, | ||
| values: array.to_a, | ||
| }) | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this test because list<int8>s boundary test covers this case?
| # Int8 can hold values from -128 to 127. | ||
| values = [ | ||
| [0, -2**7], | ||
| [2**7 - 1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using defined constants instead of comment?
| # Int8 can hold values from -128 to 127. | |
| values = [ | |
| [0, -2**7], | |
| [2**7 - 1], | |
| values = [ | |
| [0, GLib::MININT8], | |
| [GLib::MAXINT8], |
| [0, -2**7 - 1], | ||
| [2**7 - 1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| [0, -2**7 - 1], | |
| [2**7 - 1], | |
| [0, GLib::MININT8 - 1], | |
| [GLib::MAXINT8], |
| [0, -129], | ||
| [127], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| [0, -129], | |
| [127], | |
| [0, GLib::MININT8 - 1], | |
| [GLib::MAXINT8], |
Rationale for this change
When building an
Arrow::Tablefrom a Ruby Hash passed toArrow::Table.new, nested Integer arrays are incorrectly inferred aslist<uint8>orlist<int8>regardless of the actual values contained. Nested integer arrays should be correctly inferred as the appropriate list type (e.g.,list<int64>,list<uint64>) based on their values, similar to how flat arrays are handled, unless they contain values out of range for any integer type.What changes are included in this PR?
This PR modifies the logic in
detect_builder_info()to fix the inference issue. Specifically:sub_builder_infoacross sub-array elements: Previously,sub_builder_infowas recreated for each sub-array element in the Array. The logic has been updated to accumulate and carry over the builder information across elements to ensure correct type inference for the entire list.BigDecimal, the logic for determining the Integer builder has been moved tocreate_builder().detect_builder_info()now calls this function.Note:
BigDecimal(which were previously inferred asstring) may now have their types inferred. However, comprehensive testing and verification for nestedBigDecimalsupport will be addressed in a separate issue to keep this PR focused.IntArrayBuilderfor inference logic to ensure correctness. This results in a performance overhead (array building is approximately 2x slower) as we can no longer rely on the specialized builder's detection.Are these changes tested?
Yes.
ruby ruby/red-arrow/test/run-test.rbAre there any user-facing changes?
Yes.